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Abstract 

Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which 
modeling methods are the best, how the models should be built and what can they be applied to. In this study a 
linear /c-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high- 
throughput SELEX data and the question of how to choose the most informative /c-mers to the binding model is 
studied. We implemented the standard cross-validation scheme to reduce the number of /c-mers in the model and 
observed that the number of /c-mers can often be reduced significantly without a great negative effect on 
prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination 
between bound and unbound sequences as model prediction accuracies increased for all proteins together with 
the cycle number. We compared prediction performance of /c-mer and position specific weight matrix (PWM) 
models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k- 
mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment 
cycles, classification accuracies were on average 71% and 62% for /c-mer and PWMs, respectively. Finally, the /c-mer 
model trained with SELEX data was evaluated on ChlP-seq data demonstrating substantial improvements for some 
proteins. For protein GATAl the model can distinguish between true ChlP-seq peaks and negative peaks. For 
proteins RFX3 and NFATCl the performance of the model was no better than chance. 



Introduction 

Many proteins bind DNA and do that in a sequence spe- 
cific way. These DNA-binding proteins include transcrip- 
tion factors (TF), among others, which have an important 
function in regulating gene expression by affecting tran- 
scription and chromatin state. Given the fundamental 
role of TPs in cellular processes and difficulty in measur- 
ing their binding sites, computational analysis of binding 
sites can provide tremendous help in understanding 
complex regulatory mechanisms [1]. The DNA prefer- 
ence of DNA-binding proteins can be modelled with dif- 
ferent computational methods [2]. All methods require 
known binding sites or data from biological experiments, 
such as gene expression profiling, chromatin immuno- 
precipitation followed by sequencing (ChlP-seq), protein 



* Correspondence: juhani.kahara@aalto.fi 

^Department of Information and Computer Science, Aalto University School 
of Science, FI-00076 Aalto, Finland 

Full list of author information is available at the end of the article 



binding microarrays (PBM) or systematic evolution of 
ligands by exponential enrichment (SELEX) followed by 
sequencing, to build the models. 

The modeling method which has become the major 
paradigm is the position specific weight matrix (PWM) 
model [3,4]. Position weight matrices are constructed by 
using sequences from an experiment or automatically 
aligning binding sites within longer sequences [5,6]. The 
number of each base is calculated in each position of the 
alignment, and then each base is assigned a score based 
on the counts. This way each position treats the nucleo- 
tides independently from the other positions: the score is 
based only on the frequency of the base in that certain 
position. Consequently, PWMs have been criticized that 
they might lose some important dependencies between 
nearby nucleotides, but PWMs provide a very easy and 
intuitive modeling framework and, moreover, thousands 
of different PWMs exist in several databases [7,8]. 
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k-mer models offer a different perspective in DNA- 
protein interactions. In /c-mer models each /c-mer, a 
nucleotide sequence of length k, is given a score that 
describes the binding affinity of the protein towards 
the k-mer. The score can be assigned to each k-mer 
for example by utilizing the distribution of k-mers in the 
binding sites, or solving the score using modeling 
approaches, k-mer models help in capturing depedencies 
between the nucleotide positions in the binding. A recent 
comparison study demonstrated that k-mer models can 
provide equal or better performance compared to tradi- 
tional PWM models, particularly so on high-throughout 
PBM data [2]. On contrary, performance of k-mer models 
relative to PWM-based methods was found to severely 
degrade when tested on in vivo data. 

Here we focus on the k-mer model by Annala et al [9] 
which was ranked the first in the original DREAMS chal- 
lenge and was among the two most accurate methods in a 
comprehensive comparison by Weirauch et al, [2]. We 
apply the k-mer model to 15 proteins for which SELEX 
data is available [10] and, thus, report performance of the 
k-mer model on another high-throughput protocol com- 
plementary to PBM data used in [2] . The performance is 
also compared against PWM models derived from the 
same data. 

To improve robustness and generalization of k-mer 
models we studied different feature selection strategies for 
choosing the most informative k-mers to the model. The 
first feature selection strategies were to select the most fre- 
quent or the most enriched k-mers. In addition we imple- 
mented the cross-validation scheme for choosing the 
k-mers. Finally, the k-mer model trained with SELEX data 
was evaluated on ChlP-seq data with varying results. 

Materials and methods 

SELEX data 

In SELEX protocol [10] the experiment starts with a large 
pool of all different DNA-sequences of fixed length. The 
protein of interest is introduced into the pool, and 
the protein will recognize and bind its target sequences. 
Then an antibody is added to the solution and the 
protein-DNA complexes are immunoprecipitated. The 
bound sequences are then amplified and sequenced. 
The process is repeated by using the bound sequences as 
a new initial pool of sequences. The protocol produces 
massive amount of bound sequences (reads) for different 
iterations which can be used to construct models for 
describing DNA-protein interactions as well as to evalu- 
ate their predictive performance. 

The SELEX data from [10] contained enriched reads 
from the SELEX process for different enrichment cycles. 
To account for non-specific carryover of DNA from pre- 
vious cycles, a specially designed multinomial method 
was used to construct PWMs from high-throughput 



SELEX data in [10]. The obtained PWMs were found to 
be very similar with those obtained by the standard 
MEME [5], thus providing us a good benchmark. The 
PWM-models were however constructed using only one 
cycle, and only that cycle was taken into consideration in 
the performance comparison between k-mer and PWM 
models. Cycles that were included consisted between 
9000 and 175000 reads with average being 62000 reads. 
These sets of reads were randomly divided to training 
and test sets with ratio 7:3. The training set was used for 
training the k-mer model, and the test set was used in 
the performance analysis and comparisons. Random 
reads (negative set) were generated to match the number 
of enriched reads for each training and testing set. The 
choice to use uniformly random sequences instead of 
picking random genomic locations was justified by the 
fact that in the original experiment the initial pool of 
nucleotide sequences included evenly all 14-mers. The 
model performance and comparison was evaluated by a 
classification task where the test data set consisting of 
enriched reads not used in the training and equal amount 
of random reads. 

Description of the linear /c-mer model 

The linear k-mer model presented by Annala et al, [9] 
assigns an affinity score to each k-mer and the total 
binding affinity to a certain sequence is the sum of the 
binding affinities of the k-mers in that sequence. In the 
original work this affinity was measured by signal 
strength of a spot in PBM. The affinity score for the 
k-mers can be solved from the linear model Ax = Z? + £, 
where A is the design matrix, x is the affinities (col- 
umn vector), h is signal strength and E represents 
noise. In the design matrix each row represents a 
probe and columns are the variables, the k-mers. The 
Aij element of the matrix is one if the k-mer corre- 
sponding to the /th column can be found in the /th 
probe, zero otherwise. 

The k-mer model is trained by solving the linear equa- 
tion Ax = h. In the original application for PBMs the h 
vector resulting from the multiplication consisted of 
probe intensities. For SELEX experiments no intensity 
data is measured but instead the h vector is a binary vec- 
tor where one denotes a detected enriched {i,e, bound) 
read and zero denotes a random read (unbound). The 
training requires both bound and unbound sequence 
reads in order to produce useful predictive models. In 
this model training scheme the reads were treated as 
equally important, which might not be true, as the bind- 
ing affinity of a protein will vary between individual 
reads. However, given sufficient sequencing depth, the 
substantial number of reads could compensate this 
assumption because strongly bound k-mers would occur 
more frequently in the data. 
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The choice of /c-mers 

A key question with the linear k-mer model is which 
k-meis should be included into the model. To begin 
with, we chose two ways to do this. First way was to 
choose the most common k-mers in the whole data, the 
second way was to choose the most enriched /c-mers. 
The enriched k-meis were picked from the k-mei table 
which lists the counts of each k-mer in the set of bound 
reads and random unbound reads and sorts them based 
on either the difference or fold change of those counts. 
The higher the difference or fold change, the more 
enriched the /c-mer in question is. The standard cross- 
validation scheme was also implemented. First we 
started with a predefined number of most frequent 
/c-mers, and then we started removing those k-mers 
from the model, whose removal improved the results 
most or had the smallest negative effect in 10-fold 
cross-validation classification within the training set, 
essentially implementing a wrapper type of feature selec- 
tion technique. The number of k-mers in the beginning 
should not be too high, because cross-validation compu- 
tation gets heavy easily. Note that, to avoid the selection 
bias [11], the final prediction performance for the cross- 
validated model was assessed on the separate testing 
data used neither during the model training nor during 
the feature selection. 

Results 

The k-mer models trained as explained above and the 
PWM models taken from Jolma et al. [10] were used in 
classifying the reads in the testing set. Note that the PWM 
models were derived using a combination of training and 
testing data, which may slightly positively influence the 
PWM results. The classification using PWMs was con- 
ducted by scanning the reads in the testing set with the 
given model and the maximum of the scores was assigned 
to that read. The reads were classified using threshold that 
was found optimal in the training set. Classification accu- 
racy is estimated as the proportion of reads that are classi- 
fied correctly. The confidence intervals are normal 
approximation intervals. 

The classification accuracies together with 95% confi- 
dence intervals are shown in Figure 1. The k- mer model 
clearly outperforms the PWM models as can be seen 
from the figure. In the first bars the accuracy of the 
k-mer model is the accuracy using the optimal number of 
most frequent k-mers. The average classification accuracy 
is 71% for the k-mer model and 62% for the PWM mod- 
els. In the second and third bars the accuracy is obtained 
with k-mer model using the most enriched k-mers. Sur- 
prisingly k-mer model performance is better, when 
choosing the most frequent /c-mers instead of the most 
enriched k-mers. This might indicate that assigning high 
affinity scores to k-mers responsible for binding as well 



as giving negative affinity scores to frequent k-mers that 
are not part of the binding are important. 

The effect of the number of /c-mers in the model is 
shown in Figure 2. The average classification accuracy 
increases sharply when /c-mers are added to the model, 
and the accuracy reaches its maximum at about 600 
k-mers. Using more than 600 k-mers has little effect on 
results. However, for individual proteins the classifica- 
tion accuracy seems to peak at around 600 or 1500 
k-mers. In later cycles the data can be classified with 
great accuracy by using only one k-mer. For protein 
XBPl it suffices to include only k-mer ACGT to the 
model, and the data can be classified with accuracy of 
91%. It is worth noting, that the k-mer is its reverse 
complemented and can be therefore detected from both 
strands. 

Selection of /c-mers improves prediction accuracy 

The cross-validation gives a slight improvement to the 
results. If we start with for example 200 k-mers^ we can 
end up in a set of 100 /c-mers which is better than tak- 
ing just the 100 most frequent /:-mers-the main motiva- 
tion for feature selection in discriminatory analysis. The 
change in classification accuracy, when starting with 100 
k-mer Sf is shown for four proteins in Figure 3. The blue 
line represents the accuracy in the 10-fold cross-valida- 
tion within the training set, and red line is the classifica- 
tion accuracy in the testing set. Green line represents 
the classification accuracy, when the model is trained 
using the equal number of /c-mers that are most fre- 
quent in the data. Horizontal axis represents the num- 
ber of k-mers in the model. 

The binding prediction results vary between proteins. 
For example for PRDMl and HSF2 the cross-validation 
scheme can choose k-mers producing much better clas- 
sification results. The data can be classified reliably with 
only seven A:-mers (HSF2). With FOXJ3 taking the most 
frequent k-mers is equally good as choosing the k-mers 
with cross-validation. This means that for those proteins 
the most frequent k-mers are truly the most important 
ones in classification. On the other hand, for example 
for HSF2, for which the results are better using the 
cross-validation, the most important k-mers are some- 
what longer and therefore also less frequent k-mers. 

Cross-validation starting with higher number of /c-mers 
yields similar results (Figure 4). Sometimes the most fre- 
quent k-mers yield equal or even slightly better results 
than cross-validation scheme, but for some proteins the 
cross-validation introduces great advantages. For exam- 
ple for HSF2 the cross-validation clearly improves results 
as the classification accuracy is maximized around 150 
/c-mers and difference between the cross-validated and 
the enrichment method increases even more for lower 
smaller number of k-mers. Moreover, especially for 
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Figure 2 Classification accuracy as a function of number of /c-mers in the model. The test set classification accuracy plotted as a function 
of number of /c-mers in the model. The 95% normal approximation confidence intervals are plotted around each curve as grey area. 
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Figure 3 Illustration of classification accuracy using the CV-scheme for four proteins. The test set classification accuracy as a function of 
number of ^c-mers, when the /c-mers are chosen with the CV-scheme. The CV is started from 100 most frequent k-mets. The 95% normal 
approximation confidence intervals are plotted around the curves. 
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Figure 4 Illustration of classification accuracy using the CV-scheme when starting with greater number of fc-mers. The test set 
classification accuracy as a function of number of /c-mers, when the k-mets are chosen with the CV-scheme. The CV is started from 450 and 600 
most frequent /c-mers for proteins HSF2 and F0XJ3. The 95% normal approximation confidence intervals are plotted around the curves. 
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smaller number of /c-mers the cross-validated feature 
selection provides significantly better results than the 
standard approach (Figures 3 and 4). 

It was also investigated how the chosen SELEX 
enrichment cycle affects the classification accuracy. The 
results are plotted in Figure 5. It is clear that the perfor- 
mance is higher at later cycles. This is quite intuitive 
because the set of sequences should get more homoge- 
nous as the enrichment process proceeds. It is also 
noticeable how classifying the first SELEX rounds is 
much more difficult than the later cycles. At later cycles 
there is also much more variability in the results 
between proteins. The round used in the performance 
comparison is marked with a star. 

The cross-validation scheme does not choose necessa- 
rily those /c-mers to the model that are important origin- 
ally with higher number of /c-mers. It was investigated if 
the highest affinity /c-mers would be included late in the 
cross-validation. There was little overlap between the ten 
last /c-mers in the cross-validation and ten most impor- 
tant (highest affinity score) /c-mers in the most-frequent 
approach. Nevertheless the /c-mers are quite similar to 
each other in the sense that the top /c-mers in the differ- 
ent methods align relatively well to the PWMs: examples 
of FOXJ3 and HSF2 are shown in Figures 6 and 7. The 
/c-mers aligned to the logo in the left are from the latest 



rounds in the cross-validation and in the right the /c-mers 
are top-affinity /c-mers from the most-frequent approach. 
Both /c-mers and their reverse complements are taken 
into account when aligning the top /c-mers to the motifs. 

Application to ChlP-seq data 

We also assessed the applicability of the linear /c-mer 
model, trained on SELEX data, to ChlP-seq data. That is, 
the model was still trained with the SELEX data but per- 
formance was evaluated using data from ChlP-seq experi- 
ments. This required finding suitable data sets for some of 
the proteins for which SELEX data was produced in [10]. 
Three data sets, for proteins RFX3, NFATCl and GATAl, 
were used. Data for RFX3 and NFATCl is described in 
Jolma et al [10] and data for GATAl was taken from the 
ENCODE data set. Each data set contained true peaks 
supposedly bound by the protein of interest, and negative 
peaks, locations that were chosen randomly and are most 
likely unbound by the protein. 

Area under the curve (AUC) metric was chosen for 
performance evaluation: the classification threshold pre- 
viously identified from SELEX data might not be applic- 
able to other types of data, because the reads to be 
classified are longer. AUC reports the probability that a 
true bound sequence will score higher than a random 
non-bound sequence. AUC of 1 corresponds to perfect 



1r 



K-mer model accuracy on different cycles 



0.95 



0.9 



o 

(0 

o 
o 

CO 

■D 
O 

E 



0 
E 

)^ 0.7 



0.65 



0.85 



0.8 



0.75 



0.6 



F0XJ3 
HSF2 
NFATC1 
XBP1 




0.55 

0 1 2 3 4 5 

Enrichment round 

Figure 5 Classification accuracy in different enrichment cycles. The classification accuracy from different SELEX enricliment cycles for five 
proteins. The 95% normal approximation confidence intervals are plotted around the accuracies. 



Kahara and Lahdesmaki BMC Bioinformotics 2013, 14(Suppl 10):S2 
http://www.biomedcentral.eom/1 471 -21 05/1 4/S1 0/S2 



Page 7 of 9 



A 




A 
c 










A 


A 


C 


A 




A 


A 


A 


A 








A 


A 


C 


A 


A 


T 








A 


C 


A 


A 


C 










G 


A 


T 


A 


A 






A 


A 


A 


C 


A 


A 


T 


C 


G 














C 


C 


C 


C 




G 


C 


A 


A 














A 


A 


A 


T 








A 


A 


T 


A 


A 


A 


A 


A 


A 


A 


A 






A 


T 


A 


A 


T 












C 


A 


A 


C 


A 






A 


A 


A 


G 


T 










A 


A 


T 


G 


G 


A 


A 


A 


A 








G 


G 


A 


A 








G 


A 


A 


G 










G 


T 


A 


A 


A 


C 





Figure 6 Top /c-mers aligned to F0XJ3 logo. The top /c-mers chosen with the CV-scheme (left) and the top affinity /c-mers from the most 
frequent approach (right) aligned to F0XJ3 sequence logo. 



discovery of the true peaks without making any false 
positive predictions, and AUG of 0.5 can be reached by 
classifying the peaks randomly. Mann- Whitney confi- 
dence intervals are suitable for estimating the confidence 
interval of the AUG metric [12]. From each data set AUG 
was calculated using the entire peak regions that can 
span several hundred nucleotides and for the centers of 
each peak (50nt and lOOnt regions). 



For protein NFATGl AUG is plotted against the num- 
ber of /c-mers in the model in Figure 8. As can be seen 
the AUG stays quite steadily in the range of 0.5 meaning 
that the /c-mer model fails to distinguish between true 
peaks and negative peaks. 

For GATAl the results are more interesting, as in some 
cases the /c-mer model provides great predictive power. 
The GhlP-seq peaks can be quite accurately distinguished 
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Figure 7 Top /c-mers aligned to HSF2 logo. The top /c-mers chosen with the CV-scheme (left) and the top affinity /c-mers from the most 
frequent approach (right) aligned to HSF2 sequence logo. 
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Figure 8 AUC as a function of number of k-mers in the model for two NFATC1 ChlP-seq samples, (left) AUCs when the true binding sites 
are tal<en to be within 100 nucleotides around the summit of the ChlP-seq peak, (right) The same as (left) except each binding site is taken to 
be the whole ChlP-seq peak region. The 95% Mann-Whitney confidence intervals plotted around the curves. 



from false peaks as the AUC rises close to 0.9 with rela- 
tively high number of k-mers in the model (Figure 9, left). 
The summits in turn can be distinquished from false 
peaks by using only small number of k-mers selected with 
the cross-validation scheme (Figure 9, right). There is 
however somewhat controversial phenomenon of AUC 
dropping significantly below 0.5 when using greater num- 
ber of k-mers in the model. Whether its a property of the 
data or the k-mer model remains to be investigated. 



Nevertheless, taken together, our results indicate that the 
previously reported poor performance of the k-mer model 
on in vivo data can be improved using careful feature 
selection strategies. 

Discussion 

Feature selection in k-mer models can be approached in 
many different ways. Our results indicate that it is prefer- 
able to include the most frequent k-mers in the model 
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Figure 9 AUC as a function of number of /c-mers in the model for GATA1 ChlP-seq samples. AUC as a function of number of /c-mers for GATAl, 
when /c-mers are selected with the CV-scheme. In left figure (red) the AUC is plotted when using the most frequent /c-mers. In right figure the AUC is 
calculated when only the centers of the ChlP-seq peaks are used. The 95% Mann-Whitney confidence intervals plotted around the curves. 
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instead of choosing the most enriched ones. This 
approach inevitably favours shorter /c-mers because they 
are statistically more likely to appear frequently in 
sequence data. We also observed that, for SELEX data, 
increasing the number of /c-mers in the model does not 
improve predictions after certain point (about 600 
/c-mers). This point most likely changes depending on 
which kind of data the model is applied to. 

The number of /c-mers can be reduced without great 
negative effects in prediction accuracy using the standard 
cross-validation scheme. Consequently, this decreases the 
number of parameters which need to be estimated for 
the /c-mer models making them more attractive and 
applicable binding prediction model. 

Although it is quite clear that the k-mer model outper- 
forms widely used PWM-models within the SELEX data 
set, the performance of the k-mer model with in vivo data 
still remains an open question. For two data sets (proteins 
NFATCl and RFX3) the /c-mer model failed to distinquish 
between true binding sites and unbound sites. For GATAl 
the results however seem very promising as the AUG of 
close to 0.9 was reached. 
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